library(tidyverse)
Download the example file from COSMIC. This is an archive of files. Each has 100 lines of the the full files (that you have to register to access)
download.file("https://cog.sanger.ac.uk/cosmic/taster/example_grch38_cosmic.tar",destfile = "example_grch38_cosmic.tar")
Use R command to extract files from the archive
untar(tarfile = "example_grch38_cosmic.tar",exdir="cosmic_example")
mutants <- read_tsv("cosmic_example/CosmicMutantExport.tsv")
Parsed with column specification:
cols(
.default = col_character(),
`Gene CDS length` = col_double(),
`HGNC ID` = col_double(),
ID_sample = col_double(),
ID_tumour = col_double(),
GRCh = col_double(),
`FATHMM score` = col_double(),
Pubmed_PMID = col_double(),
ID_STUDY = col_double(),
Age = col_double()
)
See spec(...) for full column specifications.
mutants
Selecting columnsWe can use the select function to print only the columns we are interested in. Column names with a space need to have single quote marks around them ```
select(mutants, `Gene name`, `Primary site`, `Mutation ID`)
select(mutants, `Gene name`:`Sample name`)
select(mutants, contains("Mutation"))
We notice that information on the genomic location is contained in one column; whereas we might like columns for the chromosome, start and end.
mutants <- tidyr::separate(mutants, `Mutation genome position`, into=c("Chromosome","Start","End"))
filtering the rowsfilter(mutants, `FATHMM prediction`=="PATHOGENIC")
filter(mutants, `Gene name`=="CMYA5")
Data can be sorted according to column with the arrange function. We try and sort the rows by chromosome, but there is a problem with the output
arrange(mutants, Chromosome)
mutants <- mutate(mutants, Chromosome = as.numeric(Chromosome))
arrange(mutants, Chromosome, desc(Start))
Finding all lung cancer mutations that are pathogenic
filter(mutants, `Primary site` == "lung",`FATHMM prediction` == "PATHOGENIC") %>%
arrange(Chromosome, Start) %>%
dplyr::select(`Gene name`, `Mutation ID`, `Mutation CDS`)
ggplot(mutants, aes(x=`FATHMM prediction`)) + geom_bar()
Accessing the full dataset requires registration with COSMIC. Assuming you have completed that process you should be able to download a file CosmicMutantExportConsensus.tsv.gz containing all mutants in the Cancer Gene Consensus. It is around 40MB so takes a little bit of time to read into R.
(even though the file is compressed (.gz) we can still read it into R)
consensus_mutants <- read_tsv("CosmicMutantExportCensus.tsv.gz")
Parsed with column specification:
cols(
.default = col_character(),
`Gene CDS length` = col_double(),
ID_sample = col_double(),
ID_tumour = col_double(),
GRCh = col_double(),
`FATHMM score` = col_double(),
Pubmed_PMID = col_double(),
ID_STUDY = col_double(),
Age = col_double(),
Tier = col_double()
)
See spec(...) for full column specifications.
The data frame object now has 700K rows
consensus_mutants
A useful tool is count that creates a table of how many genes are found in the file
count(consensus_mutants, `Gene name`)
What mutations are recorded for TP53?
filter(consensus_mutants, `Gene name`=="TP53")
What is the most commonly mutated gene in breast cancer?
filter(consensus_mutants, `Primary site`=="breast") %>%
count(`Gene name`) %>%
arrange(desc(n))
What cancer types is TP53 most commonly mutated in?
filter(consensus_mutants, `Gene name`=="TP53") %>%
ggplot(aes(x=`Primary site`)) + geom_bar()
There are a few too many sites to see the plot properly, so we can further filter and also flip the axis so it is easier to read
filter(consensus_mutants, `Gene name`=="TP53") %>%
count(`Primary site`) %>%
filter(n > 1000) %>%
ggplot(aes(x=`Primary site`,y=n)) + geom_bar(stat="identity") + coord_flip()
Do TP53 mutations have a different age distribution in different cancer types?
filter(consensus_mutants, `Gene name`=="TP53") %>%
ggplot(aes(x=`Primary site`,y=Age)) + geom_boxplot() + coord_flip()
We will use the somatic calls that have previously been made for the HCC1143 breast cancer cell line. The calls were made as part of a Cancer Research Uk Bioinformatics Summer School.
The following workflow will be used:-
VariantAnnotationTxDb.... to annotate variants within coding regionsBsgenome.... to assess functionorg.Hs.eg.db to convert to recognisable gene symbolsdplyr and use the functions from the previous sectionurl <- "https://github.com/bioinformatics-core-shared-training/cruk-summer-school-2016/raw/master/Day3/HCC1143_vs_HCC1143_BL.flagged.muts.vcf"
download.file(url, destfile="HCC1143_vs_HCC1143_BL.flagged.muts.vcf")
The VariantAnnotation package can be used to import the data into R. This package is distributed as part of the Bioconductor project.
library(VariantAnnotation)
raw_vcf <- readVcf("HCC1143_vs_HCC1143_BL.flagged.muts.vcf")
To make the process go more smoothly, we will consider variants on chromosomes 1 to 22 only. The naming convention of the chromosomes also needs to be changed to be compatible with the annotation resources we are going to use.
raw_vcf <- keepSeqlevels(raw_vcf, 1:22,pruning.mode = "coarse")
seqlevelsStyle(raw_vcf) <- "UCSC"
genome(raw_vcf) <- "hg19"
The location of the variants can be returned using the rowRanges function that is part of VariantAnnotation.
var_locs <- rowRanges(raw_vcf)
var_locs
GRanges object with 87753 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
6c54cf8c-3ee9-11e6-9d63-a89a8835fc03 chr1 16103 * | <NA> T G <NA> VUM;MQ;MN
6c54d310-3ee9-11e6-9d63-a89a8835fc03 chr1 17222 * | <NA> A G <NA> MNP;VUM;MQ;MN
6c54d3ce-3ee9-11e6-9d63-a89a8835fc03 chr1 74570 * | <NA> T A <NA> MQ;VUM
6c54d46e-3ee9-11e6-9d63-a89a8835fc03 chr1 101445 * | <NA> G C <NA> MQ;MNP
6c54d504-3ee9-11e6-9d63-a89a8835fc03 chr1 101456 * | <NA> A G <NA> MNP;MQ
... ... ... ... . ... ... ... ... ...
6dc4437a-3ee9-11e6-9d63-a89a8835fc03 chr22 51192990 * | <NA> C A <NA> MNP
6dc44406-3ee9-11e6-9d63-a89a8835fc03 chr22 51238839 * | <NA> C T <NA> MQ
6dc44492-3ee9-11e6-9d63-a89a8835fc03 chr22 51238846 * | <NA> G A <NA> RP;MQ
6dc44528-3ee9-11e6-9d63-a89a8835fc03 chr22 51239025 * | <NA> C T <NA> MQ;MNP;MN
6dc445b4-3ee9-11e6-9d63-a89a8835fc03 chr22 51239040 * | <NA> G A <NA> MN;MQ
-------
seqinfo: 22 sequences from hg19 genome
We notice that the vcf file already has some filters applied for quality. Variants that pass all the filters have an entry of PASS. We will use these variants in subsequent analyses.
N.B. the data aren’t yet in a form that we can use the dplyr operations from previously, so we have to use a different approach to filter the rows.
pass_vcf <- raw_vcf[var_locs$FILTER == "PASS"]
pass_locs <- rowRanges(pass_vcf)
The Bioconductor project also distributes several annotation packages that can be used to annotate our results. The collection of TxDb.... packages comprise definitions of transcripts for a given genome build.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
The locateVariants function can be used to find the variants. Various types of match including CodingVariants that will find variants inside coding regions. The output is converted to a data frame so we can use the dplyr functions we saw earlier
We clean-up the output so only certain columns are displayed. The QUERYID column needs to be kept to refer back to the complete set of variants in the filtered vcf file.
var_annotations <- locateVariants(pass_locs, txdb, CodingVariants())
GRanges object contains 653 out-of-bound ranges located on sequences 7, 8, 220, 4318, 4319, 4320, 4321, 4323, 4708, 5800, 6512, 2803, 3144, 7538, 7539, 7930, 7931, 11490, 8897, 9119,
9249, 12212, 12213, 12485, 12590, 12833, 10288, 13104, 14225, 14387, 22341, 20690, 20691, 23089, 21780, 21788, 21789, 21790, 21791, 27655, 30453, 29152, 29153, 29159, 31222, 29398,
33491, 33492, 33493, 34133, 34135, 34154, 32887, 35157, 35796, 37617, 39458, 39459, 39460, 39461, 39544, 39545, 39546, 39547, 39548, 39549, 39794, 39795, 40302, 40716, 40717, 40718,
40719, 40721, 40722, 40723, 40724, 40726, 40727, 44317, 42208, 42209, 44941, 44942, 45684, 45685, 48678, 48720, 48721, 49165, 49404, 50849, 52689, 51422, 51423, 51590, 53239, 53246,
53247, 53248, 53497, 53498, 53540, 53541, 53542, 53543, 53544, 53865, 53866, 55771, 55772, 56334, 56337, 56698, 56701, 56702, 56703, 56704, 56705, 57211, 58018, 58020, 64269, 64270,
64271, 64272, 64273, 64274, 65311, 65112, 65813, 65903, 66184, 69031, 69032, 66684, 66687, 67037, 67691, 67692, 67693, 67694, 70538, 70539, 73793, 73797, 73798, 73799, and 74949. Note
that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular() to get the lengths and
circularity flags of the underlying sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information.
var_annotations <- as.data.frame(var_annotations) %>%
dplyr::select(QUERYID,GENEID) %>% unique %>%
filter(!is.na(GENEID))
var_annotations
With the predictCoding function we can assess whether the mutations are going to result in amino acid change or not. This makes use of the Genome sequence, which is available in the BSgenome.Hsapiens.UCSC.hg19 package.
Again we only use certain columns from the output.
library(BSgenome.Hsapiens.UCSC.hg19)
var_predictions <- predictCoding(pass_vcf, txdb, seqSource = Hsapiens)
GRanges object contains 653 out-of-bound ranges located on sequences 7, 8, 220, 4318, 4319, 4320, 4321, 4323, 4708, 5800, 6512, 2803, 3144, 7538, 7539, 7930, 7931, 11490, 8897, 9119,
9249, 12212, 12213, 12485, 12590, 12833, 10288, 13104, 14225, 14387, 22341, 20690, 20691, 23089, 21780, 21788, 21789, 21790, 21791, 27655, 30453, 29152, 29153, 29159, 31222, 29398,
33491, 33492, 33493, 34133, 34135, 34154, 32887, 35157, 35796, 37617, 39458, 39459, 39460, 39461, 39544, 39545, 39546, 39547, 39548, 39549, 39794, 39795, 40302, 40716, 40717, 40718,
40719, 40721, 40722, 40723, 40724, 40726, 40727, 44317, 42208, 42209, 44941, 44942, 45684, 45685, 48678, 48720, 48721, 49165, 49404, 50849, 52689, 51422, 51423, 51590, 53239, 53246,
53247, 53248, 53497, 53498, 53540, 53541, 53542, 53543, 53544, 53865, 53866, 55771, 55772, 56334, 56337, 56698, 56701, 56702, 56703, 56704, 56705, 57211, 58018, 58020, 64269, 64270,
64271, 64272, 64273, 64274, 65311, 65112, 65813, 65903, 66184, 69031, 69032, 66684, 66687, 67037, 67691, 67692, 67693, 67694, 70538, 70539, 73793, 73797, 73798, 73799, and 74949. Note
that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular() to get the lengths and
circularity flags of the underlying sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information.
names(var_predictions) <- make.names(names(var_predictions),unique = TRUE)
var_predictions <- as.data.frame(var_predictions) %>%
select(QUERYID,CONSEQUENCE,REFCODON,VARCODON,REFAA,VARAA) %>% unique
var_predictions
Now the data are in data frames, we can join the two tables together with dplyr.
var_annotations <- left_join(var_annotations,var_predictions,by="QUERYID")
var_annotations
The gene names are currently Entrez ID, which although useful for mapping, do not give particularly memorable names that we can search for. We use a pre-built annotation package org.Hs.eg.db to obtain gene symbols.
library(org.Hs.eg.db)
gene_anno <- AnnotationDbi::select(org.Hs.eg.db,
columns="SYMBOL",
keys=var_annotations$GENEID,
keytype = "ENTREZID")
'select()' returned many:1 mapping between keys and columns
gene_anno <- rename(gene_anno, "GENEID"=ENTREZID)
var_annotations <- left_join(var_annotations, gene_anno) %>% unique
Joining, by = "GENEID"
var_annotations
We can put everything together into one table. The locations from the vcf need to be converted into a data frame including a QUERYID column that can be matched to the annotations. This is defined to be the row number.
final_tab <- as.data.frame(pass_locs) %>%
mutate(QUERYID = 1:n()) %>% select(-width,-paramRangeID,-ALT,-QUAL,-FILTER,-strand) %>%
inner_join(var_annotations, by="QUERYID") %>%
select(seqnames:end,SYMBOL,REF:VARAA,-QUERYID)
final_tab
The dplyr functions and ggplot2 introduced in the previous section can now be applied to our list of variants.
filter(final_tab, CONSEQUENCE=="nonsense")
What genes are most commonly mutated?
count(final_tab, SYMBOL) %>% arrange(desc(n))
How many variants are located on each chromosome?
ggplot(final_tab, aes(x=seqnames,fill=CONSEQUENCE)) + geom_bar(position="dodge") + coord_flip()